lfq <- read.csv('./data/nonimputed_lfq.csv')
LFQ_KO <- lfq %>%
select(contains('KO'))
LFQ_WT <- lfq %>%
select(contains('WT'))
ggarrange(
plothist(LFQ_KO, '- KO'),
plothist(LFQ_WT, '- WT'),
nrow=2,ncol=1
)
# skewness of nonimputed data
moments::skewness(lfq, na.rm = TRUE)
## KO.22 KO.23 KO.24 WT.22 WT.23 WT.24
## 0.6754195 0.6725489 0.5218363 0.7211065 0.6383786 0.7043256
# skewnes of imputed data
lfq.imp <- read.csv('./data/LFQ_raw_totals_imp.csv')
moments::skewness(lfq.imp)
## KO_TOTALS_22 KO_TOTALS_23 KO_TOTALS_24 WT_TOTALS_22 WT_TOTALS_23 WT_TOTALS_24
## 0.5761698 0.5868388 0.5206581 0.6285968 0.5813767 0.6579983
The imputation doesn’t have influence on skewness of data. 🕺 Furthermore the imputation fixed it in some level, because the skewness coeff is lower for imputed data.
protti librarylibrary(protti)
Thanks to Mateusz, we had data almost ready for analysis with
protti, but there were some important columns missing. So,
based on Matis’ code, I transformed and selected features based on The
Input Preparation Workflow from the protti
documentation.
proteingroups <- readr::read_tsv("data/proteinGroups.txt", show_col_types = FALSE)
proteingroups <- proteingroups %>% filter(is.na(`Only identified by site`),
is.na(Reverse),
is.na(`Potential contaminant`))
to_protti <- proteingroups %>%
select(`Protein IDs`, `Peptide sequences`,
contains('LFQ intensity') & contains('TOTALS') & (ends_with('22') | ends_with('23') | ends_with('24'))) %>%
mutate(`Protein IDs`= paste('prot_', 1:nrow(proteingroups), sep='')) %>%
pivot_longer(3:8, names_to = 'Sample', values_to = 'Intensity')%>%
mutate(Sample = gsub('LFQ intensity ', '', Sample)) %>%
mutate(Sample = gsub('_TOTALS_', '_', Sample)) %>%
separate(col = Sample, into = c("celltype","rep"), sep = "_", remove = F) %>%
mutate(Condition = ifelse(celltype == 'KO', 'treated', 'control'),
Intensity = ifelse(Intensity == 0, NA, log2(Intensity))) %>%
select(Sample, `Protein IDs`, `Peptide sequences`, Condition, Intensity)
We can do most of the calculations from the
protti library with data structured as above.
The protti library gives us fxn to normalise data, but
the function has only median method which is the original intensity
minus the run median plus the global median.
normalised_data <- to_protti %>%
protti::normalise(sample = Sample,
intensity_log2 = Intensity,
method = 'median')
normalised_data %>%
group_by(Sample) %>%
summarise(skewness = moments::skewness(normalised_intensity_log2, na.rm=TRUE))
## # A tibble: 6 × 2
## Sample skewness
## <chr> <dbl>
## 1 KO_22 0.675
## 2 KO_23 0.673
## 3 KO_24 0.522
## 4 WT_22 0.721
## 5 WT_23 0.638
## 6 WT_24 0.704
The library gives us 2 methods for imputation:
method = "ludovic", MNAR missingness is sampled from a
normal distribution around a value that is three lower (log2) than the
lowest intensity value recorded for the precursor/peptide and that has a
spread of the mean standard deviation for the precursor/peptide.method = "noise", MNAR missingness is sampled from a
normal distribution around the mean noise for the precursor/peptide and
that has a spread of the mean standard deviation (from each condition)
for the precursor/peptide.Before imputation, we have assign the missigness rows in our data. We
can do that with assign_missingess()
fxn from protti library. This fxn returns a inputed
dataframe extended by:
comparison column contains the comparison name for
the specific treatment/reference pair.missingness column reports the type of
missingness.Types of missingness:
"complete": No missing values for every replicate of
this reference/treatment pair for the specific grouping variable."MNAR": Missing not at random. All replicates of either
the reference or treatment condition have missing values for the
specific grouping variable."MAR": Missing at random. At least n-1 replicates have
missing values for the reference/treatment pair for the specific
grouping varible.NA: The comparison is not complete enough to fall into
any other category. It will not be imputed if imputation is
performed.# Assigning missing rows
data_missing <- normalised_data %>%
assign_missingness(sample=Sample,
condition = Condition,
grouping = `Protein IDs`,
intensity = normalised_intensity_log2,
ref_condition = 'all')
We can see that in the data there is still a lot of missing rows.
ludovicludovic_imp <- impute(
data_missing,
sample = Sample,
grouping = `Protein IDs`,
intensity_log2 = normalised_intensity_log2,
condition = Condition,
comparison = comparison,
missingness = missingness,
method = 'ludovic',
skip_log2_transform_error = TRUE
)
The imputation is done, but we have too much missing data in the column and the fxn can’t impute a value there. The conclusion is that we still have a lot of missing values for peptides that didn’t have an LFQ value.
noisenoise_imp <- impute(
data_missing,
sample = Sample,
grouping = `Protein IDs`,
intensity_log2 = normalised_intensity_log2,
condition = Condition,
comparison = comparison,
missingness = missingness,
method = 'noise',
skip_log2_transform_error = TRUE
)
Ludovic imputation method has less skewness coefficient. Important is that both imputation methods decreases the skewness coefficient.
DIFFERENTIAL ABUNDANCE AND SIGNIFICANCE
In the tutorial, they calculating Differential Abundance and Significance using data prepared like above. We can run calculations on missing or imputed data. Like in tutorial we are using the moderate t-test to calculate this statistic.
All methods:
t-test - Welch test.t-test_mean_sd - Welch test on means, standard
deviations and number of replicates.moderated_t-test - a moderated t-test based on the
limma packageproDA - can be used to infer means across samples based
on a probabilistic dropout model. This eliminates the need for data
imputation since missing values are inferred from the model.result <- ludovic_imp %>%
calculate_diff_abundance(sample=Sample,
condition = Condition,
grouping = `Protein IDs`,
intensity_log2 = imputed_intensity,
missingness = missingness,
comparison = comparison,
filter_NA_missingness = TRUE,
method = 'moderated_t-test')
The result can be visualised with the volcano plot, which can also be interactive, but we can’t change it to suit our needs. Below is a manually generated volcano plot showing the significant proteins split by type of missigness, which also has an influence on the type of imputation.
ludovic_imp$mean <- ifelse(ludovic_imp$Condition == 'control',
mean(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'control')], na.rm = TRUE),
mean(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'treated')], na.rm = TRUE))
ludovic_imp$sd <- ifelse(ludovic_imp$Condition == 'control',
sd(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'control')], na.rm = TRUE),
sd(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'treated')], na.rm = TRUE))
ludovic_imp$n_samples <- ifelse(ludovic_imp$Condition == 'control',
length(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'control')]),
length(ludovic_imp$imputed_intensity[which(ludovic_imp$Condition == 'treated')]))
result <- ludovic_imp %>%
calculate_diff_abundance(sample=Sample,
condition = Condition,
grouping = `Protein IDs`,
intensity_log2 = imputed_intensity,
missingness = missingness,
comparison = comparison,
filter_NA_missingness = TRUE,
method = 't-test_mean_sd',
mean = mean,
sd = sd,
n_samples = n_samples,
ref_condition = 'all')
The volcano plot look strange for the result of the DAAS with
t-test_mean_sd.
CORELATION OF VARIATION
When we run code bellow, we have data frame with 5138 rows.
lfq_long <- read.csv('./data/LFQ_long_raw.csv')
lfq_long %>%
select(Sample) %>%
filter(grepl('TOTALS', Sample) & (grepl('22', Sample) | grepl('23', Sample) | grepl('24', Sample)))%>%
group_by(Sample) %>%
summarise(test = length(Sample))
## # A tibble: 6 × 2
## Sample test
## <chr> <int>
## 1 KO_TOTALS_22 5138
## 2 KO_TOTALS_23 5138
## 3 KO_TOTALS_24 5138
## 4 WT_TOTALS_22 5138
## 5 WT_TOTALS_23 5138
## 6 WT_TOTALS_24 5138
But, when you read the original data from
proteingroups.txt and make some basic filtration we have
5905 rows.